mean_pinball_loss (Pinball / Quantile Loss)#
Mean pinball loss (pinball loss / quantile loss) measures the average error of quantile predictions.
If you predict an \(\alpha\)-quantile (e.g. \(\alpha=0.9\) for the 90th percentile), pinball loss answers:
“How far off are my quantile predictions, with the correct asymmetric cost?”
It is the standard loss behind quantile regression and a building block for prediction intervals (fit two quantiles, e.g. 0.05 and 0.95).
Learning goals#
Define pinball loss precisely (with math)
Build intuition for the asymmetric penalty controlled by \(\alpha\)
Implement
mean_pinball_lossfrom scratch in NumPy (including weights / multi-output)Visualize why the minimizer is the \(\alpha\)-quantile
Use mean pinball loss to fit a simple linear quantile regressor (subgradient descent)
Understand pros/cons, pitfalls, and best use cases
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots
from scipy.stats import norm
from sklearn.linear_model import QuantileRegressor
from sklearn.metrics import mean_pinball_loss
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
1) Definition#
For a target \(y\) and a prediction \(\hat{y}\), the pinball loss at quantile level \(\alpha \in [0, 1]\) is:
Equivalently, using the residual \(u = y - \hat{y}\):
Or with an indicator:
For \(n\) samples, mean pinball loss is:
Connection to MAE#
At \(\alpha = 0.5\), the loss is symmetric and proportional to absolute error:
So it has the same minimizer as MAE (the median), but differs by a constant scaling factor.
Weighted version#
With non-negative sample weights \(w_i\):
The loss is in the same unit as \(y\), is always non-negative, and the best value is \(0\).
# A tiny example: alpha controls the asymmetry
y_true = np.array([1.0, 2.0, 3.0])
# under-predict the first sample vs over-predict the last sample
pred_under = np.array([0.0, 2.0, 3.0])
pred_over = np.array([1.0, 2.0, 4.0])
for alpha in [0.1, 0.5, 0.9]:
loss_under = mean_pinball_loss(y_true, pred_under, alpha=alpha)
loss_over = mean_pinball_loss(y_true, pred_over, alpha=alpha)
print(f"alpha={alpha:.1f} | loss(under)={float(loss_under):.4f} | loss(over)={float(loss_over):.4f}")
alpha=0.1 | loss(under)=0.0333 | loss(over)=0.3000
alpha=0.5 | loss(under)=0.1667 | loss(over)=0.1667
alpha=0.9 | loss(under)=0.3000 | loss(over)=0.0333
2) Intuition: a “tilted” absolute error#
Pinball loss is a piecewise-linear function of the prediction \(\hat{y}\).
If you under-predict (\(\hat{y} < y\)), the loss slope is \(-\alpha\).
If you over-predict (\(\hat{y} > y\)), the loss slope is \((1-\alpha)\).
So \(\alpha\) chooses which mistake is more expensive:
\(\alpha=0.9\) (upper quantile): under-prediction is expensive, over-prediction is cheap → pushes predictions upward.
\(\alpha=0.1\) (lower quantile): over-prediction is expensive, under-prediction is cheap → pushes predictions downward.
# Visualize the loss for a single sample
y0 = 2.0
yhat_grid = np.linspace(y0 - 4, y0 + 4, 400)
fig = go.Figure()
for alpha in [0.1, 0.5, 0.9]:
diff = y0 - yhat_grid
loss = alpha * np.maximum(diff, 0) + (1 - alpha) * np.maximum(-diff, 0)
fig.add_trace(go.Scatter(x=yhat_grid, y=loss, mode="lines", name=f"alpha={alpha}"))
fig.add_vline(x=y0, line_dash="dash", line_color="gray")
fig.update_layout(
title="Pinball loss for one sample (y = 2.0)",
xaxis_title="prediction ŷ",
yaxis_title="loss ℓα(y, ŷ)",
)
fig.show()
3) Why it targets quantiles#
Consider a constant predictor \(\hat{y}=c\) (no features).
A key fact:
The value of \(c\) that minimizes the expected pinball loss is an \(\alpha\)-quantile of \(Y\).
Sketch (subgradient): for one sample,
if \(c < y\) then \(\ell_\alpha(y,c) = \alpha(y-c)\) and \(\partial_c \ell_\alpha = -\alpha\)
if \(c > y\) then \(\ell_\alpha(y,c) = (1-\alpha)(c-y)\) and \(\partial_c \ell_\alpha = (1-\alpha)\)
So the expected subgradient at \(c\) is:
The optimum is where this crosses \(0\), which gives the quantile condition.
In a finite sample, minimizing mean pinball loss over \(c\) yields the sample \(\alpha\)-quantile.
# Demo: the best constant predictor is the sample quantile
n = 500
# A skewed distribution to make the shift visible
y = rng.lognormal(mean=0.0, sigma=0.6, size=n)
alphas = [0.1, 0.5, 0.9]
lo, hi = np.quantile(y, [0.01, 0.99])
c_grid = np.linspace(lo, hi, 400)
fig = go.Figure()
for alpha in alphas:
diff = y[:, None] - c_grid[None, :]
loss = alpha * np.maximum(diff, 0) + (1 - alpha) * np.maximum(-diff, 0)
mpl = loss.mean(axis=0)
q = float(np.quantile(y, alpha))
c_star = float(c_grid[np.argmin(mpl)])
fig.add_trace(go.Scatter(x=c_grid, y=mpl, mode="lines", name=f"alpha={alpha}"))
fig.add_vline(x=q, line_dash="dash", line_color="gray")
fig.add_vline(x=c_star, line_dash="dot", line_color="gray")
fig.update_layout(
title="Constant prediction: minimizer is the α-quantile",
xaxis_title="constant prediction c",
yaxis_title="mean pinball loss",
)
fig.show()
4) NumPy implementation (from scratch)#
Below is a small implementation that mirrors scikit-learn’s API shape:
supports
sample_weightsupports multi-output targets (
(n_samples, n_outputs))supports
multioutput='raw_values'or averaging across outputs
def _as_2d(a):
a = np.asarray(a)
if a.ndim == 1:
return a.reshape(-1, 1)
return a
def mean_pinball_loss_np(
y_true,
y_pred,
*,
alpha=0.5,
sample_weight=None,
multioutput="uniform_average",
):
y_true = _as_2d(y_true)
y_pred = _as_2d(y_pred)
if y_true.shape != y_pred.shape:
raise ValueError(f"shape mismatch: y_true{y_true.shape} vs y_pred{y_pred.shape}")
if not (0.0 <= alpha <= 1.0):
raise ValueError("alpha must be in [0, 1]")
diff = y_true - y_pred
loss = alpha * np.maximum(diff, 0.0) + (1.0 - alpha) * np.maximum(-diff, 0.0)
# average over samples -> per-output errors
output_errors = np.average(loss, weights=sample_weight, axis=0)
if multioutput == "raw_values":
return output_errors
if multioutput == "uniform_average":
return float(np.mean(output_errors))
# multioutput is array-like weights for outputs
multioutput = np.asarray(multioutput)
return float(np.average(output_errors, weights=multioutput))
# Quick equivalence checks with scikit-learn
y_true = rng.normal(size=50)
y_pred = y_true + rng.normal(scale=0.5, size=50)
w = rng.uniform(0.5, 2.0, size=50)
for alpha in [0.1, 0.5, 0.9]:
a0 = float(mean_pinball_loss(y_true, y_pred, alpha=alpha))
b0 = mean_pinball_loss_np(y_true, y_pred, alpha=alpha)
a1 = float(mean_pinball_loss(y_true, y_pred, alpha=alpha, sample_weight=w))
b1 = mean_pinball_loss_np(y_true, y_pred, alpha=alpha, sample_weight=w)
print(f"alpha={alpha:.1f} | unweighted diff={a0-b0:+.2e} | weighted diff={a1-b1:+.2e}")
# Multi-output example
Y_true = rng.normal(size=(40, 2))
Y_pred = Y_true + rng.normal(scale=0.3, size=(40, 2))
mean_pinball_loss_np(Y_true, Y_pred, alpha=0.5, multioutput="raw_values")
alpha=0.1 | unweighted diff=+0.00e+00 | weighted diff=+0.00e+00
alpha=0.5 | unweighted diff=+0.00e+00 | weighted diff=+0.00e+00
alpha=0.9 | unweighted diff=+0.00e+00 | weighted diff=+0.00e+00
array([0.1162, 0.1234])
5) Using mean pinball loss to optimize a model (linear quantile regression)#
Let a linear model be:
Quantile regression fits the \(\alpha\)-quantile by minimizing mean pinball loss:
For linear models this objective is convex, but it is not differentiable at \(y=\hat{y}\). A simple low-level optimizer is subgradient descent.
Subgradient w.r.t. prediction#
For a single sample, with \(u = y - \hat{y}\):
Then the parameter subgradients are:
where \(g_i\) is a chosen subgradient for sample \(i\).
def fit_linear_quantile_regression_subgd(
X,
y,
*,
alpha=0.5,
lr=0.2,
n_iters=3000,
log_every=50,
):
X = np.asarray(X, dtype=float)
y = np.asarray(y, dtype=float).reshape(-1)
n, d = X.shape
w = np.zeros(d)
b = 0.0
steps = []
losses = []
for t in range(1, n_iters + 1):
y_pred = X @ w + b
diff = y - y_pred
# subgradient wrt prediction ŷ
g = np.where(diff > 0, -alpha, 1.0 - alpha)
g = np.where(diff == 0, 0.0, g) # pick 0 at the kink
grad_w = (X.T @ g) / n
grad_b = float(np.mean(g))
lr_t = lr / np.sqrt(t) # diminishing step size
w -= lr_t * grad_w
b -= lr_t * grad_b
if t % log_every == 0 or t == 1:
y_pred = X @ w + b
losses.append(mean_pinball_loss_np(y, y_pred, alpha=alpha))
steps.append(t)
return w, b, np.array(steps), np.array(losses)
# Synthetic heteroscedastic data where the true conditional quantiles are linear
n = 600
x = rng.uniform(0, 10, size=n)
beta0 = 1.0
beta1 = 2.0
sigma0 = 0.5
sigma1 = 0.2 # noise scale increases with x
sigma = sigma0 + sigma1 * x
noise = rng.normal(loc=0.0, scale=sigma, size=n)
y = beta0 + beta1 * x + noise
X = x.reshape(-1, 1)
alphas = [0.1, 0.5, 0.9]
fits = {}
for a in alphas:
w, b, steps, losses = fit_linear_quantile_regression_subgd(X, y, alpha=a)
fits[a] = {"w": w, "b": b, "steps": steps, "losses": losses}
{k: (v['w'][0], v['b']) for k, v in fits.items()}
{0.1: (1.7306963005697273, 0.22191021030730798),
0.5: (2.0163591712333595, 0.8064745345330414),
0.9: (2.259015414769386, 1.3579550011540444)}
# Visualize fitted quantile lines
x_grid = np.linspace(x.min(), x.max(), 250)
X_grid = x_grid.reshape(-1, 1)
colors = {0.1: "#1f77b4", 0.5: "#2ca02c", 0.9: "#d62728"}
fig = go.Figure()
fig.add_trace(
go.Scatter(
x=x,
y=y,
mode="markers",
name="data",
marker=dict(size=6, opacity=0.45),
)
)
for a in alphas:
w, b = fits[a]["w"], fits[a]["b"]
y_hat = X_grid @ w + b
fig.add_trace(
go.Scatter(
x=x_grid,
y=y_hat,
mode="lines",
name=f"fit α={a}",
line=dict(color=colors[a], width=3),
)
)
# (Optional) true quantile lines (we know the data-generating process)
for a in alphas:
z = norm.ppf(a)
y_true_q = (beta0 + z * sigma0) + (beta1 + z * sigma1) * x_grid
fig.add_trace(
go.Scatter(
x=x_grid,
y=y_true_q,
mode="lines",
name=f"true α={a}",
line=dict(color=colors[a], width=2, dash="dash"),
opacity=0.85,
)
)
fig.update_layout(
title="Linear quantile regression trained with mean pinball loss",
xaxis_title="x",
yaxis_title="y",
)
fig.show()
# Training curves
fig = go.Figure()
for a in alphas:
fig.add_trace(
go.Scatter(
x=fits[a]["steps"],
y=fits[a]["losses"],
mode="lines",
name=f"α={a}",
line=dict(width=3, color=colors[a]),
)
)
fig.update_layout(
title="Subgradient descent on mean pinball loss",
xaxis_title="iteration",
yaxis_title="mean pinball loss",
)
fig.show()
Sanity check: scikit-learn’s QuantileRegressor#
QuantileRegressor solves the same optimization problem (with optional regularization) using linear programming.
We’ll fit it (without regularization) and compare parameters and scores.
sk_fits = {}
for a in alphas:
model = QuantileRegressor(quantile=a, alpha=0.0, solver="highs")
model.fit(X, y)
sk_fits[a] = model
for a in alphas:
w_subgd, b_subgd = fits[a]["w"][0], fits[a]["b"]
w_sk, b_sk = sk_fits[a].coef_[0], sk_fits[a].intercept_
print(
f"α={a}: subGD w={w_subgd:.3f}, b={b_subgd:.3f} | sklearn w={w_sk:.3f}, b={b_sk:.3f}"
)
α=0.1: subGD w=1.731, b=0.222 | sklearn w=1.743, b=0.147
α=0.5: subGD w=2.016, b=0.806 | sklearn w=2.010, b=0.842
α=0.9: subGD w=2.259, b=1.358 | sklearn w=2.218, b=1.594
6) Practical usage: scoring quantile predictions#
In practice you typically:
train a model to predict a specific quantile (say \(\alpha=0.9\))
evaluate it with
mean_pinball_loss(y_true, y_pred, alpha=0.9)
If you predict multiple quantiles, compute the metric for each \(\alpha\) separately.
# Score subgradient-descent vs sklearn solutions on the same data
for a in alphas:
y_hat_subgd = X @ fits[a]["w"] + fits[a]["b"]
y_hat_sklearn = sk_fits[a].predict(X)
score_subgd = mean_pinball_loss(y, y_hat_subgd, alpha=a)
score_sklearn = mean_pinball_loss(y, y_hat_sklearn, alpha=a)
print(f"α={a}: mean_pinball_loss subGD={float(score_subgd):.4f} | sklearn={float(score_sklearn):.4f}")
α=0.1: mean_pinball_loss subGD=0.2667 | sklearn=0.2666
α=0.5: mean_pinball_loss subGD=0.5943 | sklearn=0.5942
α=0.9: mean_pinball_loss subGD=0.2589 | sklearn=0.2564
7) Pros / cons and when to use#
Pros#
Targets quantiles directly → prediction intervals and risk-sensitive forecasting
Asymmetric costs: tune \(\alpha\) to match business penalties (stockouts vs overstock, SLA breaches, etc.)
Robust to outliers compared to squared losses (linear penalty in the tails)
For linear models, the objective is convex (global optimum)
Cons#
Non-smooth at \(y=\hat{y}\) → optimization uses subgradients / linear programming and may be slower
Requires choosing \(\alpha\) (often multiple values)
Not a single-number summary of full predictive uncertainty (you need multiple quantiles)
Fitting multiple quantiles independently can lead to quantile crossing (e.g. \(\hat{q}_{0.9}(x) < \hat{q}_{0.5}(x)\))
Good use cases#
Forecasting with uncertainty bands (delivery times, demand, energy load)
Finance / risk (Value-at-Risk style quantiles)
Any regression where over- vs under-prediction costs differ
8) Pitfalls & diagnostics#
Scale dependence: like MAE, mean pinball loss is in target units; compare across datasets only after scaling/normalizing.
Choose \(\alpha\) intentionally: align it with decisions (e.g. \(\alpha=0.9\) for “plan conservatively”).
Quantile crossing: when fitting multiple \(\alpha\) values, check ordering; consider joint methods that enforce monotonicity.
Coverage check: if you fit lower/upper quantiles, verify empirical coverage (e.g. fraction of points between \(\alpha=0.05\) and \(0.95\) predictions).
9) Exercises#
Implement
mean_pinball_loss_npwithoutnp.maximum(use onlynp.where).For a fixed dataset, plot the constant-prediction loss curves for \(\alpha \in \{0.1, 0.5, 0.9\}\) on the same chart.
Fit two quantile regressors (0.1 and 0.9) and compute the empirical coverage of the resulting interval.
Add L2 regularization to the subgradient descent objective and observe the effect on the fit.
References#
scikit-learn:
sklearn.metrics.mean_pinball_lossKoenker, R. (2005). Quantile Regression. Cambridge University Press.